knitr::opts_chunk$set(echo = T)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(comment = '')
knitr::opts_chunk$set(message=FALSE)
knitr::opts_chunk$set(eval = T)
knitr::opts_chunk$set(collapse = F)
# Libraries
library(ggplot2)
library(plotly)
library(signal)
library(ggmagnify)
library(dplyr)
library(tidyr)
library(ptw)
library(gridExtra)
library(zoo)
library(psych)
library(tibble)

From Lab to Strategy: Chemometrics for Smarter Manufacturing Decisions

Imagine a company that invests time, money, and talent in developing an innovative product. Everything seems to be going well… until, months later, quality control tests reveal that some batches are not the same as the first ones. What changed? Did the process fail? Was a raw material altered? These questions create uncertainty not only in the lab but also in production, logistics, marketing, and in the perception of the final customer. This is where traditional chemical analysis alone is no longer enough.

At the heart of this story is chromatographic analysis (HPLC), a key technique to evaluate product quality and performance in industries such as cosmetics, food, pharmaceuticals, and agriculture. However, when comparing multiple batches of the same product, subtle (or not so subtle) variations can arise: changes in peak shape, shifts in retention times, or small deviations that might go unnoticed… until they cause bigger issues. These changes can happen due to differences in sample preparation, pH, mobile phase composition, temperature, system flow, or even column wear. They may also result from poor manufacturing practices, incorrect raw material dosing, or procurement issues like buying non-equivalent ingredients.

This is where chemometrics comes in as a powerful tool that goes beyond the technical aspect. Applying multivariate algorithms like Principal Component Analysis (PCA) helps to detect hidden patterns, identify outlier batches, anticipate process failures, and most importantly, make informed, data-driven decisions. It’s not just about solving an analytical problem: it’s about integrating data science into the heart of quality control, automating processes with R, and transforming complex data into strategic information for the entire company.

This approach not only improves laboratory accuracy and efficiency but also positions the company as an innovative organization, capable of anticipating challenges and adopting cutting-edge competitive tools. From production to management, from quality assurance to product development, everyone benefits when science and analytics work together. Applied chemometrics is not just a technical solution: it’s a competitive advantage that drives operational excellence and strengthens confidence in every batch that reaches the market.

Chromatographic data simulation

To exemplify this kind of problem, I created an R function called croma() which simulates a chromatogram with multiple peaks, baseline noise, and the possibility of setting the retention time shift.

The function allows us to:

• Set the number of signals (peaks)

• Adjust the noise intensity in baseline

• Generate variations in time retentions among peaks

• Define parameters such as peak high, peak width, run time

croma <- function(time_run, numbers_signal, noise_level, wp,
                  jitter_rt = FALSE, jitter_range = 0.1,
                  tr_base = NULL, h_base = NULL, wp_base = NULL,
                  plot = TRUE) {
  
  if (wp < 0.05 || wp > 0.3) {
    wp <- runif(n = 1, min = 0.05, max = 0.3)
    cat('The width peak should be between 0.05 and 0.3, so the wp chosen is:', wp, "\n")
  }
  
  # Usar base si se da, si no generar aleatoriamente
  tr <- if (is.null(tr_base)) sample(x = 1.0:time_run, size = numbers_signal, replace = FALSE) else tr_base
  hight <- if (is.null(h_base)) runif(n = numbers_signal, min = 100, max = 1000) else h_base
  width_peak <- if (is.null(wp_base)) runif(n = numbers_signal, min = 0.05, max = wp) else wp_base

  # Jitter opcional
  if (jitter_rt) {
    tr <- tr + runif(numbers_signal, min = -jitter_range, max = jitter_range)
    tr <- pmin(pmax(tr, 0), time_run)
  }
  
  # Ruido
  noise <- if (noise_level == 0) {
    rnorm(n = 1000, mean = 0, sd = 0)
  } else {
    rnorm(n = 1000, mean = 1, sd = noise_level)
  }

  base_line <- seq(0, time_run, length.out = 1000)
  Abs <- rep(0, length(base_line))
  
  for (i in 1:numbers_signal) {
    Abs <- Abs + (dnorm(base_line, mean = tr[i], sd = width_peak[i]) * hight[i] + noise)
  }
  
  data_chromatogram <- data.frame(TR = round(base_line, 2), Abs = round(Abs, 2))
  if (plot) {
    plot(x = data_chromatogram$TR, y = data_chromatogram$Abs, type = 'l',
         xlab = 'Retention time', ylab = 'Aborbance', col = "maroon2")
    axis(1, at = seq(0, time_run, by = 1))
    grid()
  } else {
    return(data_chromatogram)
  }
}

Even though the generated data are simulated, they are designed to imitate real data obtained from HPLC laboratory analysis, including undesired variations among runs, which is useful to develop and validate analytical methods based on chemometrics.

SIMULATING MULTIPLE BATCHES

Once the croma() function is defined, the next step is to simulate the dataset of samples with different batches but the same product. The purpose is to show a real laboratory analysis scenario, where different HPLC runs of products, formulations, or raw materials are analyzed. Five batches (A–E) are generated with five samples per batch, each of them with four signals or peaks. These peaks were designed with retention times, heights, and peak widths specific for each batch.

• Aleatoric variations like noise or jitter were introduced to:

• Instrumental noise in the baseline

• Retention time shifts due to chromatographic condition variations

• Subtle differences in peak shape and intensity

# Creating Sample A
number_of_signals <- 5
set.seed(320)
number_of_peak <- 4
tr_A <- sample(1.5:10, size = number_of_peak) 
h_A <- runif(number_of_peak, min = 200, max = 650) 
wp_A <- runif(number_of_peak, min = 0.05, max = 0.06) 

dataset_lot_A <- list()
noise_val <- sample(seq(0,0.5, by = 0.01), size = 5, replace = FALSE)

for(i in 1:number_of_signals){
  dataset_lot_A[[i]] <- croma(
    time_run = 10,
    numbers_signal = number_of_peak,
    noise_level = noise_val[i],
    wp = max(wp_A),
    jitter_rt = TRUE,
    jitter_range = 0.2,
    tr_base = tr_A, 
    h_base = h_A, 
    wp_base = wp_A, 
    plot = FALSE
  )
  dataset_lot_A[[i]]$Sample <- paste0('SampleA_', i)
  dataset_lot_A[[i]]$lot <- 'lot_A'
}
dataset_lot_A <- do.call(rbind, dataset_lot_A)


# Creating Sample B
set.seed(320)
number_of_peak <- 4
tr_B <- sample(1.5:10, size = number_of_peak) 
h_B <- runif(number_of_peak,  min = 120, max = 580) 
wp_B <- runif(number_of_peak,  min = 0.05, max = 0.06) 

dataset_lot_B <- list()
noise_val <- sample(seq(0,0.5, by = 0.01), size = 5, replace = FALSE)

for(i in 1:number_of_signals){
  dataset_lot_B[[i]] <- croma(
    time_run = 10,
    numbers_signal = number_of_peak,
    noise_level = noise_val[i],
    wp = max(wp_A),
    jitter_rt = TRUE,
    jitter_range = 0.15,
    tr_base = tr_B, 
    h_base = h_B, 
    wp_base = wp_B,
    plot = FALSE
  )
  dataset_lot_B[[i]]$Sample <- paste0('SampleB_', i)
  dataset_lot_B[[i]]$lot <- 'lot_B'
}
dataset_lot_B <- do.call(rbind, dataset_lot_B)

# Creating Sample C
set.seed(320)
number_of_peak <- 4
tr_C <- sample(1.5:10, size = number_of_peak) 
h_C <- runif(number_of_peak, min = 200, max = 650) 
wp_C <- runif(number_of_peak,  min = 0.05, max = 0.06) 

dataset_lot_C <- list()
noise_val <- sample(seq(0,1, by = 0.01), size = 5, replace = FALSE)

for(i in 1:number_of_signals){
  dataset_lot_C[[i]] <- croma(
    time_run = 10,
    numbers_signal = number_of_peak,
    noise_level = noise_val[i],
    wp = max(wp_C),
    jitter_rt = TRUE,
    jitter_range = 0.1,
    tr_base = tr_C, 
    h_base = h_C, 
    wp_base = wp_C, 
    plot = FALSE
  )
  dataset_lot_C[[i]]$Sample <- paste0('SampleC_', i)
  dataset_lot_C[[i]]$lot <- 'lot_C'
}
dataset_lot_C <- do.call(rbind, dataset_lot_C)

# Creating Sample D
set.seed(320)
number_of_peak <- 4
tr_D <- sample(1.5:10, size = number_of_peak) 
h_D <- runif(number_of_peak, min = 200, max = 650)
wp_D <- runif(number_of_peak,  min = 0.05, max = 0.06) 

dataset_lot_D <- list()
noise_val <- sample(seq(0.2,0.5, by = 0.01), size = 5, replace = FALSE)

for(i in 1:number_of_signals){
  dataset_lot_D[[i]] <- croma(
    time_run = 10,
    numbers_signal = number_of_peak,
    noise_level = noise_val[i],
    wp = max(wp_D),
    jitter_rt = TRUE,
    jitter_range = 0.22,
    tr_base = tr_D, 
    h_base = h_D, 
    wp_base = wp_D, 
    plot = FALSE
  )
  dataset_lot_D[[i]]$Sample <- paste0('SampleD_', i)
  dataset_lot_D[[i]]$lot <- 'lot_D'
}
dataset_lot_D <- do.call(rbind, dataset_lot_D)

# Creating Sample E
set.seed(320)
number_of_peak <- 4
tr_E <- sample(1.5:10, size = number_of_peak) 
h_E <- runif(number_of_peak, min = 120, max = 580)
wp_E <- runif(number_of_peak,  min = 0.05, max = 0.06) 

dataset_lot_E <- list()
noise_val <- sample(seq(0,0.9, by = 0.01), size = 5, replace = FALSE)

for(i in 1:number_of_signals){
  dataset_lot_E[[i]] <- croma(
    time_run = 10,
    numbers_signal = number_of_peak,
    noise_level = noise_val[i],
    wp = max(wp_E),
    jitter_rt = TRUE,
    jitter_range = 0.12,
    tr_base = tr_E, 
    h_base = h_E, 
    wp_base = wp_E, 
    plot = FALSE
  )
  dataset_lot_E[[i]]$Sample <- paste0('SampleE_', i)
  dataset_lot_E[[i]]$lot <- 'lot_E'
}
dataset_lot_E <- do.call(rbind, dataset_lot_E)

Chromatogram visualizations

The next graph shows the twenty-five overlapped chromatograms:

• All batches share a similar peak structure among them.

• Batches A, C, and D show similar behavior among them.

• Batches B and E show structural differences compared to the previous ones, which is rather important to evidence possible clusters or anomalies in multivariate analysis.

• The baseline noise is subtle but noticeable and will be corrected in the next steps.

In the industry and daily real work, as I mentioned earlier, those variations are due to many factors (raw material differences, changes in equipment or method conditions, batches out of specifications, expected variability among runs). That’s why analyzing these chromatographic profiles and behaviors allows us not only to detect deviations, issues, or problems, but also to identify repetitive patterns, making processes easier and more optimized, reducing time and costs, and ensuring product quality.

chromaDF <- rbind(dataset_lot_A, dataset_lot_B, dataset_lot_C, dataset_lot_D,dataset_lot_E)

p1 <- ggplot(data = chromaDF , aes(x = TR, y = Abs, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Chromatogram of 4 compounds with noise') +
  theme_minimal()
ggplotly(p1)

MOSTRAR EL RUIDO DE LA BASE LINEA AL ESCRIBIR EN MEDIUM

Smoothing chromatogram with sgolayfilt()

To enhance the quality of chromatograms, I used the Savitzky-Golay (sgolayfilt) smoothing function. This is a common technique applied in spectroscopy and chromatography to remove noise without distorting the shape of the peaks.

chromaDF_split <- split(chromaDF, chromaDF$Sample)

abs_corr <- lapply(chromaDF_split, function(df){
  abs_vals <- signal::sgolayfilt(df$Abs, p = 4, n = 23)
  df$Abs_smooth <- round(as.numeric(abs_vals),2)
  return(df)
})

chromaCorDF <- do.call(rbind, abs_corr)

p2 <- ggplot(data = chromaCorDF, aes(x = TR, y = Abs_smooth, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Chromatogram of 4 compounds with reduced noise') +
  theme_minimal()
ggplotly(p2)
from <- c(xmin = 3.5, xmax = 5.40, ymin = -25, ymax = 40)
to <- c(0, 5 , 2200, 3500)

gridExtra::grid.arrange((p1 + theme(legend.position = "none", panel.grid = element_blank()) +
                           ggmagnify::geom_magnify(from = from, 
                                                   to = to, shadow = F,axes = "xy",
                                                   proj = "single",colour = "black", 
                                                   linetype = 1)),
                        (p2 + theme(legend.position = "none", panel.grid = element_blank()) +
                           ggmagnify::geom_magnify(from = from, 
                                                   to = to, shadow = F,axes = "xy",
                                                   proj = "single",colour = "black", 
                                                   linetype = 1)),
                        ncol = 2) 

The comparison plot shows that the noise was significantly reduced by the previous procedure. This step is crucial to improve visualization and simplify further steps like alignment, normalization, or multivariate analysis.

Also, it allows us to reduce interpretation errors caused by noise, improve the accuracy of compound quantification, and automate decision-making based on cleaned data.

Peak aligning function

The next code corrects retention time deviations among different samples to facilitate more accurate comparisons. I used the ptw() function (Piecewise Time Warping) to align peaks at specific retention time windows, breaking down the chromatogram into as many peaks as it has. I created a function called align_pecks() that takes the first chromatogram as a reference, then applies smoothing and individual warping. Also, it allows us to compare graphically the before and after of a chromatogram.

chromaCorDF_w <- chromaCorDF %>% 
  select(-Abs) %>% 
  pivot_wider(names_from = TR, values_from = Abs_smooth, names_prefix = 'Min_')


align_pecks <- function(df, tri , trf, plot = F, sample_n = F){
  if(typeof(tri)!='character'|typeof(trf)!='character'){
    print('The TR must be a string') }
  section <- df[,which(colnames(df) == tri):which(colnames(df) == trf)]
  reference <- as.matrix(section[1,])
  # return(list(section,reference))
  aligned_df <- lapply(X = 2:nrow(section), FUN = function(i){
    ptw(ref = reference, samp = as.matrix(section[i,]),
        warp.type = "individual",      # Suaviza la deformación
        smooth.param = 5e3,
        optim.crit = "RMS",         # Usa correlación cruzada ponderada
        init.coef = c(0, 1, 0)      # Pequeño ajuste inicial
        )$warped.sample   
  })
  #aligned_df <- do.call(rbind,aligned_df)
  aligned_df <- data.frame(do.call(rbind,aligned_df) )
  
  if(plot ==T){
    plot(t(reference), type = 'l', xlab = 'TR', ylab ='Abs',
         main = 'Plot Peaks aligned', 
         sub = paste0('Compared sample: ', df[1:nrow(df),][sample_n,1])
         )
    lines(t(aligned_df[sample_n,]), type = 'l', col = 'red', lwd = 2)
    lines(t(df[sample_n,which(colnames(df) == tri):which(colnames(df) == trf)]),type = 'h', col = 'blue')
    legend('topleft', legend = c('Ref', 'Aligned', 'Bef_alig'), col = c('black','red','blue'),
           title = 'Samples', pch = 20)
    }
  return(aligned_df)
}

This approach not only improves data analysis, but also has a wide impact in real lab applications. In quality assurance, it helps to compare specific peaks between batches or formulations, increasing confidence in detecting impurities or variations. The automation of peak alignment reduces manual work and human errors. In production, it helps identify retention time shifts caused by changes in pH, temperature, or column condition, which improves process control. It also reduces costs by avoiding unnecessary reruns, saving solvents, machine time, and staff effort. Finally, this method supports compliance by generating reproducible chromatogram reports that make regulatory validation easier.

par(mfrow=c(2,2)) 
section_1 <- align_pecks(df = chromaCorDF_w, tri = 'Min_0', 
                         trf = 'Min_3', plot = T, sample_n = 14)
section_2 <- align_pecks(df = chromaCorDF_w, tri = 'Min_3.01', 
                         trf ='Min_6',  plot = T, sample_n = 14)
section_3 <- align_pecks(df = chromaCorDF_w, tri = 'Min_6.01', 
                         trf = 'Min_7', plot = T, sample_n = 14)
section_4 <- align_pecks(df = chromaCorDF_w, tri = 'Min_7.01', 
                         trf = 'Min_10', plot = T, sample_n = 14)

par(mfrow=c(1,1)) 

alignedDF <- cbind(section_1,section_2,section_3,section_4)
chromaCorAligDF_w <- chromaCorDF_w

chromaCorAligDF_w[2:nrow(chromaCorAligDF_w),3:ncol(chromaCorAligDF_w)] <- alignedDF

In the above graphs, we can see how the peak shifts were corrected by the function. We can notice how the signals looked before being aligned (blue bars) and after the procedure was applied. The black line represents the reference chromatogram, and the red line shows the aligned chromatogram.

This code significantly improves the quality of chromatographic preprocessing, allowing robust comparative analysis. It is a corporate tool for any laboratory that works with complex mixture sample analysis, reducing errors and time, and improving operational efficiency.

Tackling NAs

After aligning the chromatograms, some missing values are generated by the process. This happens due to the deformation of the peaks to match the reference chromatogram—some points are moved out of the original range, leaving some spaces without data.

cat('There are',sum(is.na(chromaCorAligDF_w)),'missing values')
There are 631 missing values

To solve this problem, I applied linear interpolation with the na.approx() function, which computes the missing values by joining the neighboring points with a straight line. I mean, the missing values are calculated following the slope between the previous point and the next one.

chromaCorAligDF_w[,3:ncol(chromaCorAligDF_w)] <- t(
  apply(
    X = chromaCorAligDF_w[,3:ncol(chromaCorAligDF_w)], 
    MARGIN = 1, FUN = function(x){
  x_interp <- zoo::na.approx(x, na.rm = F)
  x_filled <- zoo::na.locf(x_interp, na.rm = F) 
  x_filled <- zoo::na.locf(x_filled, fromLast = T) 
  return(x_filled)
}))

Whether the NA appears at the beginning or at the end (where there aren’t two points to interpolate), it is handled using the na.locf() function, which fills the missing value with the last known one, either forward or backward.

cat('After handling missing values there are',sum(is.na(chromaCorAligDF_w)), 'NAs')
After handling missing values there are 0 NAs

With this procedure, I recover the matrix without missing values, ready for further analysis.

p3 <- chromaCorAligDF_w %>% 
  gather(key = 'TR', value = 'Abs', 3:1002) %>% 
  mutate(TR = as.double(gsub('Min_', '', TR))) %>% 
  ggplot(aes(x = TR, y = Abs, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  theme_minimal()
ggplotly(p3)
gridExtra::grid.arrange(

chromaCorDF %>% 
  filter(TR >= 4.5 & TR < 6.4 & stringr::str_detect(Sample, 'SampleA|SampleC|SampleD')) %>% 
  ggplot(aes(x = TR, y = Abs_smooth, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Second signal reduced noise, batches A,C,D') +
  theme_minimal() + theme(legend.position = "none", panel.grid = element_blank()),

chromaCorAligDF_w %>% 
  gather(key = 'TR', value = 'Abs_smooth', 3:1002) %>% 
  mutate(TR = as.double(gsub('Min_', '', TR))) %>% 
  filter(stringr::str_detect(Sample, 'SampleA|SampleC|SampleD') & TR >= 4.5 & TR < 6.4) %>% 
  ggplot(aes(x = TR, y = Abs_smooth, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Second signal aligned, batches A,C,D') +
  theme_minimal() + theme(legend.position = "none", panel.grid = element_blank()),

chromaCorDF %>% 
  filter(TR >= 4.5 & TR < 6.4 & stringr::str_detect(Sample, 'SampleB|SampleE')) %>% 
  ggplot(aes(x = TR, y = Abs_smooth, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Second signal reduced noise, batches B,E') +
  theme_minimal() + theme(legend.position = "none", panel.grid = element_blank()),

chromaCorAligDF_w %>% 
  gather(key = 'TR', value = 'Abs_smooth', 3:1002) %>% 
  mutate(TR = as.double(gsub('Min_', '', TR))) %>% 
  filter(stringr::str_detect(Sample, 'SampleB|SampleE') & TR >= 4.5 & TR < 6.4) %>% 
  ggplot(aes(x = TR, y = Abs_smooth, colour = Sample)) +
  geom_line(linewidth = 0.1) + 
  ggtitle(label = 'Second signal aligned, batches B,E') +
  theme_minimal() + theme(legend.position = "none", panel.grid = element_blank()),

 ncol = 2, nrow = 2)

It’s great how we can align the chromatogram’s peaks to make comparisons easier and more accurate. This helps us detect differences between samples, even if they were slightly shifted. It improves the quality of the analysis and supports better decisions.

Discretization or binning

In order to compare chromatograms in a more structured way, I divided each signal into sections according to its retention time: areas without peaks (flat sections) and areas with peaks. This segmentation was done using manually defined time intervals. Then, the area under the curve was calculated for each segment using the trapezoid formula. This step converts the signals so they can be compared to each other, facilitating statistical analysis, automated quality control, and deviation detection among batches.

spectrum_acp <- chromaCorAligDF_w %>% 
  gather(key = 'TR', value = 'Abs', 3:1002) %>% 
  mutate(TR = as.double(gsub('Min_', '', TR))) %>% 
  mutate(Minute = case_when(
    TR >= 0 & TR < 2.1 ~ 'section_1',
    TR >= 2.1 & TR < 3 ~ 'peak_1',
    TR >= 3 & TR < 5.01 ~ 'section_2',
    TR >= 5.01 & TR < 5.9 ~ 'peak_2',
    TR >= 5.9 & TR < 6.9 ~ 'peak_3',
    TR >= 6.9 & TR < 8 ~ 'section_3',
    TR >= 8 & TR < 9 ~ 'peak_4',
    TR >= 9 & TR < 10 ~ 'section_4',
    TRUE ~ 'Unknown'
  )) %>% 
  group_by(lot, Sample,Minute) %>% 
  summarise(area_peak = sum(Abs * (TR[2]-TR[1]), na.rm = T), .groups = 'drop') %>% 
  pivot_wider(names_from = Minute, values_from = area_peak, names_prefix = 'Area_') %>% 
  select(-Area_Unknown, -Area_section_1,-Area_section_2,-Area_section_3,-Area_section_4)

spectrum_acp
# A tibble: 25 × 6
   lot   Sample    Area_peak_1 Area_peak_2 Area_peak_3 Area_peak_4
   <chr> <chr>           <dbl>       <dbl>       <dbl>       <dbl>
 1 lot_A SampleA_1        285.        337.        233.        554.
 2 lot_A SampleA_2        285.        336.        238.        554.
 3 lot_A SampleA_3        286.        337.        232.        554.
 4 lot_A SampleA_4        290.        334.        230.        554.
 5 lot_A SampleA_5        285.        337.        230.        554.
 6 lot_B SampleB_1        242.        298.        187.        520.
 7 lot_B SampleB_2        244.        298.        193.        521.
 8 lot_B SampleB_3        244.        298.        188.        520.
 9 lot_B SampleB_4        242.        297.        187.        520.
10 lot_B SampleB_5        244.        298.        186.        520.
# ℹ 15 more rows

Aeras by peck, batch and Sample

The area under the curve of each signal is shown in the graph below for each sample and each batch. This allows us to visualize how the intensity of each signal varies in a general way among batches, using a bar plot and organized by peaks.

p4 <- spectrum_acp %>% 
  gather(key = 'Area', value = 'value', 3:6) %>% 
  ggplot(aes(x = Sample, y = value, colour = lot, fill = lot)) +
  geom_bar(stat = 'identity',position="dodge") +
  facet_wrap(~Area, scales = "free_x") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0)) +
  theme(legend.position = "none", panel.spacing.y = unit(3, "cm"))
  
ggplotly(p4)

PCA Assumptions

# Bartlett test
print('########## Prueba de Bartlett')
[1] "########## Prueba de Bartlett"
spectrum_acp %>% 
  select(-lot, -Sample) %>% 
  scale(center = T, scale = T) %>% 
  cortest.bartlett(n = 30)
$chisq
[1] 356.2335

$p.value
[1] 7.08148e-74

$df
[1] 6
print('########## KMO (Kaiser-Meyer-Olkin) test')
[1] "########## KMO (Kaiser-Meyer-Olkin) test"
# KMO (Kaiser-Meyer-Olkin) test
spectrum_acp %>% 
  select(-lot, -Sample) %>% 
  scale(center = T, scale = T) %>% 
  cor() %>% 
  KMO()
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = .)
Overall MSA =  0.75
MSA for each item = 
Area_peak_1 Area_peak_2 Area_peak_3 Area_peak_4 
       0.72        0.71        0.99        0.65 
print('########## Determinant test')
[1] "########## Determinant test"
# Determinant test
spectrum_acp %>% 
  select(-lot, -Sample) %>% 
  scale(center = T, scale = T) %>% 
  cor() %>% 
  det()
[1] 8.204152e-08

Before applying Principal Component Analysis (PCA), we verified whether the data structure justified dimensionality reduction:

Bartlett’s Test Indicates whether variables are sufficiently correlated. • Chi-squared = 356.23 • p-value = 7.08e-74 Interpretation: Variables show significant correlations. PCA is justified.

KMO Index Measures sampling adequacy. • Overall KMO = 0.75 • All variables between 0.65 and 0.99 Interpretation: The structure is adequate for PCA.

Determinant of the Correlation Matrix • Value = 8.2e-08 Interpretation: High multicollinearity. Dimensionality reduction is appropriate.

Conclusion: The results confirm that PCA is appropriate for this dataset. The variables are correlated and share enough structure to be summarized by components.

Perform the NIPALS ACP algorithm

After confirming the PCA assumptions, I decided to code my own PCA function, called niplasJDM(), to control each step of the analysis. This approach allows full transparency throughout the process, enabling a better understanding of the data’s behavior and the flexibility to adapt the analysis to specific project needs.

nipalsJDM <- function(df, n_comp, scale = F){
  # Mean centering
  X <- scale(df, center = T, scale = scale)
  
  #total variance, compute covariance
  SStot <- sum(X^2)
  # Prepare matrixes for soceres, loadings
  scores <- matrix(0, nrow = nrow(X), ncol = n_comp)
  loads <- matrix(0, nrow = ncol(X), ncol = n_comp)
  expvar <- rep(0, n_comp) # explained variance
  resvar <- rep(0, n_comp)# Residuals variace
  acumvar <- rep(0, n_comp) # Acumulated variance
  for(a in 1:n_comp){
    # Initialization
    col.ind <- which.max(apply(X = X, MARGIN = 2, FUN = sd))
    t <- X[,col.ind, drop = F]
    for(i in 1:30){
      # compute the loadings
      p <- crossprod(X,t)
      p <- p/sqrt(sum(p^2))
      # Compute scores
      t <- X%*%p
    }
    scores[,a] <- t
    loads[,a] <- p
    tpt <- tcrossprod(t,p)
    X <-  X - tpt 
    
    expvar[a] <- sum(tpt^2) / SStot
  }
  resvar <- 1 - expvar
  acumvar <- cumsum(expvar)
  colnames(scores) <- colnames(loads) <- paste0("PC", 1:n_comp)
  rownames(scores) <- rownames(df)
  rownames(loads) <- colnames(df)
  
  names(expvar) <- names(resvar) <- names(acumvar) <- paste0("PC", 1:n_comp)
  return(list(scores = scores, loads = loads, expvar = expvar, resvar = resvar, acumvar = acumvar))
}

What does the function do? Centers (and optionally scales) the data. Computes the scores (the new coordinates of the samples on each component). Computes the loadings (how strongly the original variables relate to each component). Estimates the variance explained by each component. Calculates the residual variance (what the model does not explain). Computes the cumulative variance.

With my nipalsJDM function ready, I applied PCA to the area data. First, I converted the table to a data.frame, set the sample names as row identifiers, removed the batch column (since it is not numerical), and scaled the variables because the area values were in different magnitudes. I selected three principal components, which were enough to capture most of the information without making the model too complex.

acp_spec <- spectrum_acp %>% 
  as.data.frame() %>% 
  tibble::column_to_rownames(var = 'Sample') %>% 
  select(-lot) %>% 
  nipalsJDM(n_comp = 3, scale = T) 

acp_spec
$scores
                PC1          PC2           PC3
SampleA_1  1.620104  0.046776807 -0.0411165327
SampleA_2  1.733355  0.237747117  0.0038200404
SampleA_3  1.598999 -0.018421850 -0.0255827809
SampleA_4  1.586538 -0.134737031  0.2362354173
SampleA_5  1.558993 -0.075641880 -0.0409808764
SampleB_1 -2.461898 -0.077077224 -0.0458738710
SampleB_2 -2.274731  0.154974378  0.0261724980
SampleB_3 -2.392854 -0.062346383  0.0012772444
SampleB_4 -2.471428 -0.026404015 -0.0240374622
SampleB_5 -2.432145 -0.116228809 -0.0059355325
SampleC_1  1.495025 -0.086337341 -0.0783834172
SampleC_2  1.511293 -0.121749735 -0.0394024790
SampleC_3  1.602810 -0.018894353 -0.0380762822
SampleC_4  1.508241 -0.038606307 -0.0405100673
SampleC_5  1.591658 -0.022364096 -0.0305402380
SampleD_1  1.665264  0.076744699  0.0009169974
SampleD_2  1.739516  0.240025610 -0.0001800921
SampleD_3  1.595134 -0.023663599 -0.0283514545
SampleD_4  1.545264 -0.075153316  0.1552730667
SampleD_5  1.579149 -0.029757800 -0.0393202323
SampleE_1 -2.398516  0.015923032  0.0618254139
SampleE_2 -2.426996  0.004882514 -0.0575302411
SampleE_3 -2.246162  0.265272194  0.0549949874
SampleE_4 -2.404328 -0.053182502  0.0145035980
SampleE_5 -2.422284 -0.061780110 -0.0191977042

$loads
                  PC1        PC2         PC3
Area_peak_1 0.5000946 -0.3635237  0.70564730
Area_peak_2 0.5003600 -0.2246274 -0.69695279
Area_peak_3 0.4987372  0.8623999  0.08626321
Area_peak_4 0.5008058 -0.2714027 -0.09421971

$expvar
        PC1         PC2         PC3 
0.995583095 0.003161358 0.001168400 

$resvar
        PC1         PC2         PC3 
0.004416905 0.996838642 0.998831600 

$acumvar
      PC1       PC2       PC3 
0.9955831 0.9987445 0.9999129 

PCA GRAPHICS

Before plotting the PCA results, I prepared some variables to make the visualization easier. First, I converted the batch column into a factor, so I could assign each one a different color and shape automatically. I used the rainbow() palette for colors and selected different symbols (pch) to identify each batch clearly. I also created a positions variable to place the variable names when plotting loadings, and calculated a proportional scale to show both scores and loadings on the same plot without distortion.

# Create a factor with the batches
lots <- factor(spectrum_acp$lot)
# Automatically assign colors to each batch
color_lots <- rainbow(n = 6)
color <- color_lots[as.numeric(lots)]

pch_lots <- c(16, 17, 15, 18, 19, 20)  # Change if you want others
# Take the batches in the correct order
lots <- factor(spectrum_acp$lot)  # already ordered by the PCA samples
# Assign pch according to batch
pchs <- pch_lots[as.numeric(lots)]

positions <- rep(1:4, length.out = nrow(acp_spec$scores))
escala <- max(abs(acp_spec$scores)) / max(abs(acp_spec$loads))
plot(acp_spec$scores[,1], acp_spec$scores[,2], 
     xlab = paste0('PC1 ',round(acp_spec$expvar[1]*100,2),'%'),
     ylab = paste0('PC2 ', round(acp_spec$expvar[2]*100,2), '%'), 
     col = color, main = "PCA - Scores + Loadings \n", col.main="black",
     sub = paste0('Variance explained: ',round((acp_spec$expvar[1]*100) + (acp_spec$expvar[2]*100),2), '%'),
     col.sub="black",
     xlim = c(-2.5,2.5), ylim = c(-2.5,2.5), pch = pchs)

text(acp_spec$scores[,1], acp_spec$scores[,2], labels = rownames(acp_spec$scores), cex = 0.5, pos = positions, offset = 0.3)

arrows(x0 = 0,y0 = 0, x1 = acp_spec$loads[,1]*escala, y1 = acp_spec$loads[,2]*escala, col = "red", length =0.1)
text(acp_spec$loads[,1]*escala,  acp_spec$loads[,2]*escala, labels = rownames(acp_spec$loads), cex = 0.6, col = 'red',  
     pos = positions, offset = 0.3)

legend('topright', legend = levels(lots), col = color_lots[1:length(levels(lots))],
      pch =  pch_lots[1:length(levels(lots))], title = 'Lots', cex = 0.6)

# 2D plot
# Getting batches, colors, pch factors
batches <- factor(spectrum_acp$lot)
colors_batch <- rainbow(n = length(unique(batches)))
colorize_batch <- colors_batch[as.numeric(batches)]

pch <- seq(1,length(unique(batches)))
pch_batches <- pch[as.numeric(batches)]
positions <- rep(1:4, length = nrow(acp_spec$scores))
scale_loadings <- max(abs(acp_spec$scores))/max(abs(acp_spec$loads))

df_scores <- acp_spec$scores %>% 
  as.data.frame() %>% 
  rownames_to_column(var = 'Sample')
df_scores$Color <- colorize_batch
df_scores$PCH <- pch_batches
df_scores$batches <- batches

p5 <- plot_ly(data = df_scores, x = ~PC1, y = ~PC2,
        text = ~Sample, color = ~batches,
        colors = ~colorize_batch,
        type = 'scatter', mode = "markers+text",
        marker = list(symbol = ~pch_batches, size = 10),
        textposition = 'topright'
) %>% 
  layout(
    title = list(text = "PCA - Scores + Loadings", font = list(color = "gray40")),
    xaxis = list(title = paste0('PC1 (', round(acp_spec$expvar[1] * 100, 2), '%)')),
    yaxis = list(title = paste0('PC2 (', round(acp_spec$expvar[2] * 100, 2), '%)')),
    legend = list(title = list(text = "Lotes")),
    margin = list(t = 60)
  ) 

# adding loadings' arros

loadings <- as.data.frame(acp_spec$loads * scale_loadings ) %>% 
  rownames_to_column(var = 'Variable')
p5 <- p5 %>% 
  add_segments(
    data = loadings,
    x = 0, y = 0,
    xend = ~PC1, yend = ~PC2,
    line = list(color = 'red4'),
    inherit = F
  ) %>% 
  add_text(
    data = loadings,
    x = ~PC1, y = ~PC2,
    text = ~Variable,
    textposition = "top right",
    textfont = list(color = 'red',size =10),
    inherit = F
  )

# 3D plot
df_scores <- as.data.frame(acp_spec$scores)
df_scores$Sample <- rownames(acp_spec$scores)
df_scores$Color <- color
df_scores$PCH <- pchs
df_scores$Lote <- lots


fig <- plot_ly(data = df_scores, x = ~PC1, y = ~PC2, z = ~PC3, type = 'scatter3d',
                 text = ~Sample, color = ~Lote, colors = color_lots,  
               mode = "markers", marker = list(size = 5), textposition = "top right")

loadings <- as.data.frame(acp_spec$loads)
loadings$Variable <- rownames(acp_spec$loads)

for (i in rownames(loadings)) {
  fig <- fig %>% add_trace(
    data = NULL,
    x = c(0, loadings[i, 'PC1'] * scale_loadings),
    y = c(0, loadings[i, 'PC2'] * scale_loadings),
    z = c(0, loadings[i, 'PC3'] * scale_loadings),
    type = 'scatter3d',
    mode = "lines+text",
    line = list(color = 'gray20', width = 2),
    text = i,
    name = i,
    showlegend = T,
    color = I("gray20"),  
    inherit = FALSE 
  )
}

fig <- fig %>% layout(
  scene = list(
    xaxis = list(showgrid = FALSE),
    yaxis = list(showgrid = FALSE),
    zaxis = list(showgrid = FALSE),
    dragmode = "turntable"
  )
)


p5
fig

After calculating the PCA, I used different types of plots to make the results easier to interpret: • 2D plot (Base R): I used a basic 2D plot to show the samples projected on PC1 and PC2. Points are colored and shaped by batch. Red arrows show the loadings (the direction and importance of each variable). This helps to see how samples group and which variables are influencing the separation. • Interactive 2D plot (Plotly): I created a dynamic version with Plotly and ggplot2. It lets you interact with the points, see sample names, highlight batches, and understand variable directions. This is useful when there are many samples. • 3D plot (Plotly): Finally, I made a 3D plot showing the first three principal components. It helps to better understand the full structure of the data. I also added the loadings in 3D to see how each variable contributes to the components.

Example of PCA interpretation from UV spectral data

acp_spec$loads %>%
  as.data.frame() %>% 
  rownames_to_column(var = 'Area_Peak') %>% 
  gather(key = 'PC', value = 'Area', 2:4) %>% 
  ggplot(aes(x = Area_Peak, y = Area, colour = Area_Peak, fill = Area_Peak)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  geom_text(aes(label = round(Area,2)), colour = 'black',
            position = position_dodge(width = 0.9),
            hjust =0, vjust = -0.5, size = 4.5, angle = 0) +
  facet_wrap(~PC,  scales = "free_x") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0)) +
  theme(legend.position = "none")

In this PCA example, Principal Component 1 (PC1) shows that all peaks increase together. This suggests that PC1 represents a general concentration trend — samples with higher PC1 values have higher amounts of all compounds. In a lab or production setting, this could help identify batches with overall high concentration, which may indicate overdosed or enriched formulations.

Principal Component 2 (PC2) shows an opposite trend: when one specific peak (e.g., peak 3) increases, the others decrease. This means some ingredients replace others. If peak 3 represents an expensive UV filter, this could show which samples belong to a premium product line. It can also help detect mistakes in cheaper formulations.

Principal Component 3 (PC3) shows a conflict between two peaks: when one goes up, the other goes down. This may suggest that two ingredients are not used together — maybe one is natural and the other synthetic. This type of information is useful to improve formulations, reduce costs, or make the product more stable and sustainable.

Example of ANOVA applied to PCA results:

To explore whether differences exist between production batches, we analyzed PC1 values by lot using ANOVA. First, we checked the assumptions for a classical (parametric) ANOVA. The residuals from the linear model (PC1 ~ Lot) were not normally distributed (Shapiro-Wilk p < 0.05), and several outliers appeared in lots A, B, and E. However, Levene’s test confirmed equal variances (p > 0.05).

library(ggstats)
library(rstatix)
library(ggstatsplot)
library(ggpubr)

# Outliers
df_scores %>% 
  select(Sample, Lote, PC1) %>% 
  group_by(Lote) %>% 
  identify_outliers(PC1)
# A tibble: 3 × 5
  Lote  Sample      PC1 is.outlier is.extreme
  <fct> <chr>     <dbl> <lgl>      <lgl>     
1 lot_A SampleA_2  1.73 TRUE       TRUE      
2 lot_B SampleB_2 -2.27 TRUE       FALSE     
3 lot_E SampleE_3 -2.25 TRUE       TRUE      
# Normality
fit <- lm(PC1 ~ Lote, data = df_scores)
shapiro_test(residuals(fit)) # p-value < 0.05, the residuals are not normally distributed
# A tibble: 1 × 3
  variable       statistic p.value
  <chr>              <dbl>   <dbl>
1 residuals(fit)     0.845 0.00139
ggqqplot(residuals(fit))

# Variance homogeneity
plot(fit, 1)

df_scores %>% 
  levene_test(PC1 ~ Lote) # p > 0.05, we cannot reject variance homogeneity
# A tibble: 1 × 4
    df1   df2 statistic     p
  <int> <int>     <dbl> <dbl>
1     4    20     0.100 0.981

Even after applying a 20% trimmed robust ANOVA, normality wasn’t achieved. So, a non-parametric ANOVA was used instead — better suited when normality assumptions are violated.

# Non-parametric tests
df_scores %>% 
  kruskal_test(PC1 ~ Lote) # the distribution across lots is not the same
# A tibble: 1 × 6
  .y.       n statistic    df        p method        
* <chr> <int>     <dbl> <int>    <dbl> <chr>         
1 PC1      25      18.7     4 0.000898 Kruskal-Wallis
# Effect size
df_scores %>% 
  kruskal_effsize(PC1 ~ Lote) # significant effect size
# A tibble: 1 × 5
  .y.       n effsize method  magnitude
* <chr> <int>   <dbl> <chr>   <ord>    
1 PC1      25   0.735 eta2[H] large    
# Post hoc tests
df_scores %>% 
  dunn_test(PC1 ~ Lote, p.adjust.method = 'bonferroni')
# A tibble: 10 × 9
   .y.   group1 group2    n1    n2 statistic       p  p.adj p.adj.signif
 * <chr> <chr>  <chr>  <int> <int>     <dbl>   <dbl>  <dbl> <chr>       
 1 PC1   lot_A  lot_B      5     5   -3.22   0.00127 0.0127 *           
 2 PC1   lot_A  lot_C      5     5   -0.988  0.323   1      ns          
 3 PC1   lot_A  lot_D      5     5   -0.0430 0.966   1      ns          
 4 PC1   lot_A  lot_E      5     5   -2.84   0.00457 0.0457 *           
 5 PC1   lot_B  lot_C      5     5    2.23   0.0255  0.255  ns          
 6 PC1   lot_B  lot_D      5     5    3.18   0.00148 0.0148 *           
 7 PC1   lot_B  lot_E      5     5    0.387  0.699   1      ns          
 8 PC1   lot_C  lot_D      5     5    0.945  0.345   1      ns          
 9 PC1   lot_C  lot_E      5     5   -1.85   0.0647  0.647  ns          
10 PC1   lot_D  lot_E      5     5   -2.79   0.00522 0.0522 ns          
ggbetweenstats(data = df_scores, x = Lote, y = PC1, type = 'np',
               bf.message = F, p.adjust.method = 'bonferroni')

The analysis revealed that lots B and E have significantly different PC1 values compared to the others. This suggests that their chemical profile differs, possibly due to changes in formulation or processing. Other lots showed similar patterns, indicating consistent production, while B and E deviate from the expected profile.

Conclusion:

This example shows how chemometrics can turn complex data like UV spectra into actionable insights. By detecting differences between batches without additional chemical tests, companies can improve quality control, save reagents, and make faster decisions. This enhances efficiency, reduces costs and time, and ultimately strengthens product quality and production reliability.

file.choose()
[1] "E:\\proyectos\\Spectros_With_R\\ACP_HPLC.html"